feat(audit): default-on integration drift detection by danielmeppiel · Pull Request #1137 · microsoft/apm

danielmeppiel · 2026-05-04T21:39:01Z

feat(audit): default-on integration drift detection

TL;DR

apm audit now detects integration drift by default. It replays apm install cache-only into a scratch tree and diffs against your working copy, catching the three cases that previously slipped past CI: .apm/ source added without re-running install (#1067), hand-edited deployed files, and orphan files left after a dependency was removed. The scan is read-only — nothing in the project, lockfile, or apm_modules/ is touched. A new .apm-pin cache marker, written by apm install, defends drift from the stale-cache failure mode (lockfile bumped → cache not refreshed).

Note

Closes #1071. Closes #898 — the integration-drift rail completes the epic; the per-file hash rail was delivered by #762/#889, and cross-skill dependency closure is already supported via the existing transitive resolver (skill bundles ship their own apm.yml). Review-cycle status: all Copilot inline comments (A1–A7) and apm-review-panel deferred follow-ups (B1 cache-pin, B2 inverse-norm, B3 normalization extract, B4 stderr logging, B5 README link) are landed in this PR.

Problem (WHY)

Today, apm audit covers lockfile consistency but is blind to the most common real-world drift mode: a developer adds a .apm/instructions/foo.md and forgets to re-run apm install, so .github/instructions/foo.md never lands. CI only catches it because every consumer repo (including this one) maintains a hand-rolled git status --porcelain script after the install step. That workaround has three problems:

It only fires when CI happens to run apm install first, so local apm audit gives false confidence.
It cannot distinguish unintegrated (forgot to install) from modified (hand-edit) from orphaned (stale file), so the failure message is unhelpful.
It is bash glue users must copy verbatim, not a tool guarantee — a Progressive Disclosure failure: "Context arrives just-in-time, not just-in-case." The drift signal should arrive with apm audit, not as a side-channel script.

Issue #898 originally asked for full lockfile-vs-deployed-content verification; the design loop captured in WIP/drift/ widened the scope to all three drift kinds and locked the surface as a single command (apm audit) with default-on behavior.

Approach (WHAT)

Decision	Choice	Rationale
Surface	Extend `apm audit` (no new command)	One mental model: "audit = is my project consistent?"
Default	ON	Catching drift is the user-promise; opt-out, not opt-in
Engine	Cache-only install replay into scratch tmpdir	Reuses the install pipeline verbatim — no second source of truth for "what should be deployed"
Mutex	`--no-drift` mutually exclusive with `--strip` and `--file`	Strip-mode and single-file mode operate on payloads, not the project; mixing them with drift is ill-defined
Output	Top-level `drift` key in JSON; `apm/drift/` rules in SARIF	Drift is orthogonal to lockfile checks; do not encode it as a fake check entry
Failure mode	Findings exit 1 in `--ci` only	Bare `apm audit` is advisory; CI mode is the gate
Stale-cache defense	`.apm-pin` JSON marker per cached package, verified before replay	Catches "teammate bumped lockfile but I haven't reinstalled" without expensive re-resolution

Implementation (HOW)

File	Change
`src/apm_cli/install/drift.py` (new, ~430 lines)	Replay engine: `ReplayConfig`, `DriftFinding`, `CacheMissError`, `CheckLogger` (with `scratch_root()` stderr helper), `run_replay()`, `diff_scratch_against_project()`, three renderers (text/JSON/SARIF). Reads `.apm-pin` marker via `verify_marker()` before each cache-only materialization
`src/apm_cli/install/cache_pin.py` (new)	Stale-cache marker: `MARKER_FILENAME=".apm-pin"`, `SCHEMA_VERSION=1`, `write_marker`, `verify_marker`, `sync_markers_for_lockfile`, `CachePinError`
`src/apm_cli/install/phases/lockfile.py`	`LockfileBuilder.build_and_save()` calls `_sync_cache_pin_markers` unconditionally after `_write_if_changed` AND `_sync_cache_pin_markers_from_disk()` on the no-install early-return path (self-heals existing caches on next `apm install`)
`src/apm_cli/utils/normalization.py` (new)	Extracted normalization helpers (Build-ID strip, BOM strip, CRLF normalize) for reuse outside drift; re-exported from `drift.py` for back-compat
`src/apm_cli/install/services.py`	Added `scratch_root` kwarg to `integrate_package_primitives()` so replay can redirect deploy paths without monkey-patching
`src/apm_cli/utils/guards.py` (new)	`_assert_scratch_bound` defensive guard preventing replay from ever writing outside the scratch tree
`src/apm_cli/utils/diagnostics.py`	New `DRIFT` diagnostic category with friendly remediation text (`DRIFT_ORPHANED` clarifies it is a remediation hint)
`src/apm_cli/commands/audit.py`	`--no-drift` flag, mutex via `UsageError`, drift wired into both `_audit_ci_gate` and `_audit_content_scan`. Bare `apm audit` is advisory (exit 0); `apm audit --ci` is the gate (exit 1)
`src/apm_cli/policy/ci_checks.py`	Typed `_check_drift` signature; CacheMiss wording is drift-specific
`.github/workflows/ci.yml`	Annotated Gate B (legacy bash drift check) as redundant once `apm-action` ships with this CLI; kept as defense-in-depth fallback for one release cycle
`tests/integration/test_drift_check.py` (new, 32 tests)	9 drift cases, 4 past-PR regressions, 7 edges, 3 multi-target, 9 default-on/`--no-drift`/exit-codes
`tests/integration/test_drift_check_e2e.py` (new, 12 tests)	Full install→mutate→audit loop, no-write contract, air-gap, JSON/SARIF stability, ASCII-only output, bare-vs-CI exit-code regression, cache-pin marker E2E
`tests/unit/install/test_drift.py` (new, 19 tests)	Normalization (incl. 3 inverse-norm regressions) + CheckLogger stderr + replay correctness
`tests/unit/install/test_cache_pin.py` (new, 14 tests)	Every error path (missing/malformed/wrong-schema/missing-field/mismatched-commit), idempotent re-write, sync skip rules, self-heal regression
`tests/unit/install/test_drift_perf.py` (new, 1 test)	100 primitives smoke ceiling under 5s
Docs	`CHANGELOG.md`, `docs/.../guides/drift-detection.md`, `packages/apm-guide/.apm/skills/apm-usage/commands.md`, README link to drift guide

Diagrams

The drift pipeline at runtime — cache-only replay with marker verification, then diff:

flowchart LR
    A["apm audit"] --> B["Read apm.lock.yaml<br/>+ persistent cache"]
    B --> P["verify_marker():<br/>.apm-pin matches<br/>resolved_commit?"]
    P -->|"yes"| C["run_replay():<br/>install pipeline into<br/>scratch tmpdir"]
    P -->|"no"| X["CacheMissError:<br/>run apm install"]
    C --> D["diff_scratch_against_project():<br/>scratch vs project"]
    D --> E["DriftFinding[]"]
    E --> F["render_drift_text /<br/>render_drift_json /<br/>render_drift_sarif"]

How drift composes with the existing audit checks (drift is a separate rail, not a fake check):

stateDiagram-v2
    [*] --> AuditEntry
    AuditEntry --> CIGate: --ci
    AuditEntry --> ContentScan: bare or PKG
    CIGate --> Baseline7: 7 lockfile checks
    Baseline7 --> Drift: if --no-drift not set
    ContentScan --> Drift: if --no-drift not set
    Drift --> Render
    Baseline7 --> Render
    Render --> Exit1: any failure in --ci
    Render --> Exit0: bare audit (advisory)

Trade-offs

Default ON costs a few seconds. The cache-only replay is fast (100 primitives in ~1.2s on the perf test) but non-zero. Escape hatch: --no-drift. Rejected: opt-in default — would replicate the fix(compile): emit and clean up copilot root instructions (#792) #1067 silent-failure mode this PR exists to fix.
Mutex with --strip/--file is strict. A user could theoretically want strip-mode + drift, but the semantics ("drift of what?") are undefined. Refused at the CLI rather than silently picked. Rejected: warn-and-continue — invites bug reports we cannot answer.
Cache-pin is stale-cache detection, NOT integrity. A .apm-pin marker proves "the lockfile that produced this cache matches the lockfile we're auditing"; it does not defend against an active adversary with write access to apm_modules/. Content-addressed hashes / signatures are deferred to a follow-up. Documented in cache_pin.py module docstring.
Corrupt lockfile silently skips drift in bare apm audit mode. In --ci mode the lockfile-exists check fires first and fails loudly; bare mode no-ops drift. Pinned by tests; tracked as follow-up.
Multi-target replay reads apm.yml's target: field. Couples engine to manifest format, but directory auto-detection alone missed targets whose deployment dirs were still empty (false orphan reports). Caught during Phase D test development.

Benefits

One command catches the three real-world drift modes (unintegrated, modified, orphaned) that previously needed three separate scripts.
CI gates collapse from two steps to one. apm audit --ci subsumes both lockfile fidelity and integration drift; the bash git status --porcelain script is now annotated as redundant.
Read-only by contract. A dedicated e2e test snapshots mtime+sha256 of every file and asserts unchanged; the air-gap test runs the entire flow asserting no network subprocess.
JSON/SARIF integration unblocks tooling. Code-scanning, dashboards, and PR bots can consume drift findings without parsing text output.
Stale-cache failure mode closed. .apm-pin marker + verify_marker rejects a cache that doesn't match the lockfile; apm install self-heals existing caches without user intervention.

Validation

$ uv run --extra dev ruff check src/ tests/ && uv run --extra dev ruff format --check src/ tests/
All checks passed!
681 files already formatted

$ uv run --extra dev pytest tests/unit/install/test_drift.py \
                            tests/unit/install/test_cache_pin.py \
                            tests/unit/install/test_drift_perf.py \
                            tests/integration/test_drift_check.py \
                            tests/integration/test_drift_check_e2e.py
============================== 77 passed in 3.97s ==============================

$ uv run --extra dev pytest tests/unit -q
7598 passed, 1 warning, 30 subtests passed in 90.05s

Scenario evidence — what is tested to work

Every user-promise scenario this PR introduces is mapped to the test that proves it. Principle taxonomy per the scenario-evidence rubric.

#	Scenario (user promise)	Principle(s)	Test	Type
1	Forgot to re-run install after adding `.apm/` source — drift caught	Governed by policy, DevX	`test_drift_check.py::TestSectionADriftCases::test_a1_unintegrated_local_source_detected`	integration
2	Hand-edit to a regenerated `.github/` file — drift caught	Governed by policy, DevX	`test_drift_check.py::TestSectionADriftCases::test_a2_modified_deployed_file_detected`	integration
3	Source removed but deployed file lingers — drift caught	Governed by policy	`test_drift_check.py::TestSectionADriftCases::test_a3_orphaned_deployed_file_detected`	integration
4	Build-ID-only line difference does NOT trigger drift	DevX	`test_drift_check.py::TestSectionBNormalization::test_b1_build_id_only_diff_no_drift`	integration
5	CRLF-only line endings (Windows checkout) do NOT trigger drift	DevX, Multi-harness	`test_drift_check.py::TestSectionBNormalization::test_b3_crlf_only_diff_no_drift`	integration
6	UTF-8 BOM does NOT trigger drift	DevX	`test_drift_check.py::TestSectionBNormalization::test_b2_bom_only_diff_no_drift`	integration
7	Multi-target (copilot+claude+cursor) replay matches all installed targets	Portability by manifest, Multi-harness	`test_drift_check.py::TestSectionDMultiTarget::test_d1_multi_target_no_drift_when_clean`	integration
8	`--no-drift` opt-out skips drift entirely	DevX	`test_drift_check.py::TestSectionENoDriftFlag::test_e3_no_drift_flag_skips_replay`	integration
9	`--no-drift` mutex with `--strip` / `--file` rejects via UsageError	DevX	`test_drift_check.py::TestSectionENoDriftFlag::test_e5_no_drift_mutex_with_strip_rejected` / `test_e6_no_drift_mutex_with_file_rejected`	integration
10	`apm audit --ci` exits 1 on drift	Governed by policy	`test_drift_check.py::TestSectionENoDriftFlag::test_e1_drift_in_ci_mode_exits_one`	integration
11	Bare `apm audit` reports drift but exits 0 (advisory)	DevX, Governed by policy	`test_drift_check_e2e.py::TestDriftE2E::test_bare_audit_with_drift_exits_zero_but_ci_audit_exits_one`	e2e
12	JSON output schema stable: top-level `drift` key	OSS / community-driven	`test_drift_check_e2e.py::TestDriftE2E::test_audit_ci_json_payload_has_stable_top_level_keys`	e2e
13	SARIF output schema stable: `apm/drift/` rules	OSS / community-driven	`test_drift_check_e2e.py::TestDriftE2E::test_audit_ci_sarif_payload_is_valid_sarif`	e2e
14	No-write contract: audit never mutates project / lockfile / `apm_modules/`	Secure by default	`test_drift_check_e2e.py::TestDriftE2E::test_apm_audit_makes_no_writes_to_working_tree` + `test_apm_audit_makes_no_writes_when_drift_present`	e2e
15	Air-gap proof: drift replay does not touch the network	Secure by default	`test_drift_check_e2e.py::TestDriftE2E::test_audit_runs_without_network_subprocesses`	e2e
16	Full install → mutate → audit → reinstall → audit loop	DevX	`test_drift_check_e2e.py::TestDriftE2E::test_install_audit_tamper_audit_reinstall_audit_loop`	e2e
17	30s smoke: full audit on a real fixture under 30s	DevX	`test_drift_check_e2e.py::TestDriftE2E::test_audit_completes_within_smoke_budget`	e2e
18	Performance ceiling: 100-primitive replay+diff under 5s	DevX	`test_drift_perf.py::test_drift_replay_100_primitives_completes_within_budget`	unit
19	Text-mode drift section uses ASCII only (Windows cp1252 safe)	DevX, Multi-harness	`test_drift_check_e2e.py::TestDriftE2E::test_audit_text_mode_drift_section_uses_ascii_only`	e2e
20	`apm install` writes `.apm-pin` marker for each remote dep	Secure by default, Governed by policy	`test_drift_check_e2e.py::TestDriftE2E::test_apm_install_writes_cache_pin_marker_for_each_remote_dep`	e2e
21	Drift replay rejects cache whose `.apm-pin` doesn't match the lockfile commit	Secure by default	`test_cache_pin.py::test_verify_marker_raises_on_commit_mismatch` (unit) + composes with #20 (e2e)	unit + e2e

Scenario evidence — what is tested NOT to regress

Each row is a real bug or false-positive that escaped to production at some point in the past. The regression-trap test is the assertion that, had it existed before, would have caught the bug. No silent regression on any of these is permitted.

#	Past failure mode	Origin	Regression-trap test	Type
R1	`.apm/` source added under a non-default subdir (Cursor commands), `.cursor/commands/` never integrated	#1067	`test_drift_check.py::TestSectionCRegressions::test_c1_regression_pr_1067_unintegrated_subdir_source`	integration
R2	Auto-detected target set diverged from `apm.yml` `target:` field, falsely orphaning files	#882 design loop	`test_drift_check.py::TestSectionCRegressions::test_c2_regression_pr_882_target_autodetection_alignment`	integration
R3	Orphan cleanup pass removed governed files	#889	`test_drift_check.py::TestSectionCRegressions::test_c3_regression_pr_889_orphan_cleanup_does_not_touch_governed`	integration
R4	Source deleted but `apm install` not re-run, deploys stayed	observed in panel review	`test_drift_check.py::TestSectionCRegressions::test_c4_regression_source_deleted_no_install`	integration
R5	Build-ID line in compiled output flagged as drift on rebuild	observed during Phase D	`test_drift.py::test_normalize_strips_build_id_lines` + inverse `test_normalize_does_not_mask_real_drift_near_build_id`	unit
R6	BOM/CRLF in mixed-OS checkouts flagged as drift	observed during Phase D	`test_drift.py::test_normalize_strips__bom`, `_crlf` + inverse `test_normalize_does_not_mask_real_drift_*`	unit
R7	`apm audit --ci` did not gate on drift findings	Copilot review A1	`test_drift_check_e2e.py::TestDriftE2E::test_bare_audit_with_drift_exits_zero_but_ci_audit_exits_one`	e2e
R8	Bare `apm audit` exited non-zero on drift, breaking advisory contract	Copilot review A1	same as R7 (one test asserts both halves)	e2e
R9	Teammate bumped `apm.lock.yaml` but didn't re-run `apm install`; drift compared new lockfile against stale cache	apm-review-panel B1	`test_cache_pin.py` (14 tests covering verify, sync, self-heal) + `test_drift_check_e2e.py::test_apm_install_writes_cache_pin_marker_for_each_remote_dep`	unit + e2e
R10	Cache predates marker contract; user is gated forever with no recovery	self-heal requirement (B1)	`test_cache_pin.py::test_sync_markers_self_heals_caches_missing_marker`	unit
R11	Marker tampering / corruption silently accepted (5 paths: missing, malformed, non-object, unsupported schema, missing field)	B1 threat model	`test_cache_pin.py::test_verify_marker_raises_on_*` (5 tests)	unit
R12	JSON/SARIF stdout polluted by drift verbose logging	apm-review-panel B4	`test_drift.py::test_check_logger_scratch_root_emits_to_stderr_when_verbose` + `test_check_logger_scratch_root_silent_when_not_verbose`	unit

The CI integration script (scripts/test-integration.sh) was extended with two new pytest blocks in run_e2e_tests() so both integration suites run in the e2e CI job.

How to test

Pull this branch, run uv sync --all-extras --dev
Run the focused suite: uv run --extra dev pytest tests/unit/install/test_drift.py tests/unit/install/test_cache_pin.py tests/unit/install/test_drift_perf.py tests/integration/test_drift_check.py tests/integration/test_drift_check_e2e.py -v — expect 77 passed
In a real consumer repo (one with an apm.yml), edit any file under .github/instructions/ and run apm audit — expect a modified finding, exit 0
Run apm audit --no-drift — drift section absent, lockfile checks still run
Run apm audit --ci -f json — JSON has top-level drift key, parses cleanly, exit code is 1 if drift exists

Note

The .github/workflows/ci.yml legacy bash drift check is intentionally kept for one release cycle as defense-in-depth until apm-action picks up a CLI version that includes this PR.

Closes #1071. Closes #898.

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

- Add _ReadOnlyProjectGuard context manager (utils/guards.py): snapshots stat of protected paths, raises ProtectedPathMutationError on any mutation. Defense-in-depth above the scratch-root remap. - Add CATEGORY_DRIFT + drift() recording method to DiagnosticCollector. - Add drift_count property and _render_drift_group renderer that groups by kind (modified/unintegrated/orphaned) with stable section header for machine consumers. - Tests: 7 unit tests covering happy path, mutation, creation, deletion, missing-tolerated, exception-not-masked, single-file protected path. Refs #1071. Phase A of WIP/drift/06-final-plan.md. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Implements the drift detection feature per WIP/drift/06-final-plan.md (closes #1071 scope alignment with #898). Engine (Phase B): - src/apm_cli/install/drift.py: ReplayConfig, DriftFinding, CheckLogger, CacheMissError, normalization helpers (build-id strip, line endings, BOM), run_replay() (cache-only), diff_scratch_against_project(), text/json/sarif renderers, atexit scratch cleanup. - src/apm_cli/install/services.py: scratch_root kwarg with ensure_path_within defense-in-depth guard for replay isolation. - src/apm_cli/policy/ci_checks.py: _check_drift() wrapper returning (CheckResult, list[DriftFinding]); graceful CacheMissError handling. CLI surface (Phase C): - src/apm_cli/commands/audit.py: --no-drift opt-out flag with mutex against --strip/--file via UsageError. Drift wired into both _audit_ci_gate (--ci) and _audit_content_scan (bare project audit) paths, default-on per ADR-02. JSON/SARIF/text renderers integrated; --no-drift warning gated to text mode (stdout cleanliness). Tests: - tests/unit/install/test_drift.py: 13 unit tests (normalization, diff cases, renderers). - Legacy --ci tests opt out of drift via batch --no-drift injection (fixture parity, not a behavior change). 7597 unit tests pass; lint clean. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Implements the locked test matrix for issue #1071 drift detection. Floor of 43 tests across three new files closes the 'ULTRA HARDENING OF HELL' coverage requirement. New files: - tests/integration/test_drift_check.py (32 tests): * Section A: 9 drift cases (modified/unintegrated/orphaned + CRLF/ BOM/Build-ID false-positive guards) * Section B: 4 past-PR regressions (#1067, #882, #889, source-deleted) * Section C: 7 edges (no/corrupt lockfile, untracked governed, no-write contract, idempotency) * Section D: 3 multi-target (copilot/claude/cursor) * Section E: 9 default-on / --no-drift opt-out (mutex, stderr routing, JSON suppression) - tests/integration/test_drift_check_e2e.py (10 tests): full install->mutate->audit loop with mix_stderr=False, air-gap proof, JSON/SARIF stability, 30s smoke - tests/unit/install/test_drift_perf.py (1 test): 100 primitives replay+diff under 5s Engine fix surfaced by tests: - src/apm_cli/install/drift.py: run_replay now reads apm.yml's target field via parse_target_field and passes it to resolve_targets. Without this, multi-target projects (copilot+claude+cursor) replayed only the auto-detected primary target, falsely reporting secondary target deployments as orphaned. Helper _read_apm_yml_target() added. CI wiring: - scripts/test-integration.sh: two new blocks in run_e2e_tests() invoking the integration + e2e suites before the final success log. Both safe to run without GITHUB_APM_PAT (cache-only, mocked network). Verification: 56 drift-domain tests pass; full repo lint clean. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- CHANGELOG.md: Added [Unreleased] entry under Added describing the default-on drift detection in apm audit, the three failure modes it catches, false-positive guards, --no-drift opt-out + mutex semantics, and the JSON/SARIF integration shape. Closes #1071, supersedes #898. - docs/src/content/docs/guides/drift-detection.md (NEW, sidebar order 7): Full user-facing guide -- what drift means, how the cache-only replay works (with mermaid diagram), exit-code matrix, when to use --no-drift, output formats, and the CI single-line gate that replaces the legacy git status --porcelain script. - packages/apm-guide/.apm/skills/apm-usage/commands.md: Extended the audit row with --no-drift flag and added a paragraph documenting the drift-by-default behavior, three failure modes, false-positive normalization, and JSON/SARIF integration. Aligns the skill that ships in apm-guide with the new CLI surface (per apm-keep-docs-up-to-date.instructions.md rule 4). - .github/workflows/ci.yml: Annotated Gate B (legacy bash drift check) with a comment marking it redundant once apm-action ships a CLI with default-on drift detection (this PR's release). Kept as defense-in-depth fallback until then. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

danielmeppiel · 2026-05-04T21:51:20Z

APM Review Panel: `ship_with_followups`

Default-on drift detection is a strong positioning move; ship after adding the recovery hint and syncing stale docs -- zero blockers, clean severity discipline.

The panel converged cleanly: zero blocking findings across eight specialists, strong approval of the replay-engine architecture, and consistent praise for the read-only/cache-only safety boundary. The highest-signal finding came from two independent personas (cli-logging-expert AND devx-ux-expert) identifying the same gap: drift output tells users WHAT drifted but never HOW to fix it. This violates APM's 'Include the fix' messaging rule and npm-audit's precedent. A one-line append ('Run: apm install') is a trivial fix that should land in this PR before merge -- it's the difference between a diagnostic and a product.

The supply-chain-security-expert raised a legitimate concern: replay trusts cache content without verifying it matches lockfile resolved_commit. The current defense (NotImplementedError on network replay + cache populated only by prior verified installs) is pragmatic for v1, but shared CI caches and mounted volumes are a real threat vector. This warrants a tracked follow-up issue, not a gate. The test-coverage-expert's missing inverse-normalization test (outcome: missing, unit tier) is worth noting: the safety invariant 'normalization does NOT mask real drift' has no automated guard. This is a regression-trap gap on a secure-by-default surface -- elevate above opinion-tier nits.

Doc-sync debt is the other consistent theme: cli-commands.md, ci-cd.md (still recommending the bash workaround this PR supersedes), and quick-start.md all need updates. The doc-writer and devx-ux-expert converge here. These are cheap fixes that should land in-PR to avoid shipping a feature whose docs contradict themselves on day one.

Dissent. No substantive dissent. Python-architect's scratch_root seam concern and supply-chain's atexit fragility point at the same underlying trade-off (temp-dir lifecycle); both agree it's recommended-tier, not blocking. Auth-expert correctly self-deactivated -- no auth surface touched.

Aligned with: secure-by-default (default-on, cache-only, no credential surface), governed-by-policy (SARIF + --ci exit code), portable-by-manifest (lockfile resolved_commit as source of truth), pragmatic-as-npm (one command, zero config, opt-out via --no-drift).

Growth signal. Drift detection opens a new pillar under 'lockfile-grade reproducibility'. The launch beat writes itself: 'apm audit catches what your CI bash script misses.' Before/after story (4 lines of fragile bash vs. one command with SARIF for free) is strong social-thread material. Ensure cross-links land so the guide isn't an orphan page with zero internal-link equity.

Panel summary

Persona	R	N	Takeaway
Python Architect	3	2	Solid replay-engine design with frozen dataclasses and defense-in-depth guards; scratch_root kwarg is a pragmatic seam but warrants a follow-up to formalize as a deploy-target factory.
CLI Logging Expert	1	3	Drift output follows ASCII/stderr conventions well; missing actionable fix hint violates 'Include the fix' rule; verbose scratch-path disclosure absent.
DevX UX Expert	2	2	Missing recovery hint in drift output and missing cli-commands.md update; otherwise the surface design is sound.
Supply Chain Security Expert	3	2	Cache-integrity verification absent from replay path; scratch containment is sound; air-gap test blocks subprocesses but not socket-level egress; atexit cleanup is SIGTERM-fragile on shared CI runners.
OSS Growth Hacker	3	2	Strong launch-beat story; needs cross-links and a 60-second demo path to avoid the new guide being an island.
Doc Writer	5	2	Drift guide is well-written; ship after fixing two stale companion docs and one self-contradictory section. CHANGELOG bullet violates Keep-a-Changelog one-line rule.
Test Coverage Expert	2	0	42 integration+e2e tests cover all three drift kinds, no-write contract, air-gap, SARIF/JSON stability, and --no-drift mutex. Two gaps: no test that normalization guards do NOT mask real drift, and CacheMissError path untested at unit tier.

B = blocking-severity findings, R = recommended, N = nits.
Counts are signal strength, not gates. The maintainer ships.

Top 5 follow-ups

[DevX UX Expert] Add recovery hint to drift output: 'Run: apm install' after each drift group -- Two-persona convergence (cli-logging + devx-ux). Violates 'Include the fix' rule. One-line fix that converts a diagnostic into a product moment. Should land in-PR.
[Doc Writer] Update cli-commands.md (--no-drift flag), ci-cd.md (remove obsolete bash workaround), and quick-start.md (stale cross-ref) -- Three stale doc surfaces contradict the new behavior. Shipping without sync means day-one user confusion. Cheap fixes, same PR.
[Supply Chain Security Expert] Track follow-up issue: verify cache content matches lockfile resolved_commit before replay trusts it -- Shared CI caches and mounted volumes are a real poisoning vector. Current defense (prior-install trust) is pragmatic for v1 but the integrity primitive (commit-SHA pinning) is bypassed at replay time.
[Test Coverage Expert] Add inverse-normalization unit test asserting real content drift is NOT suppressed by BOM/CRLF/Build-ID guards -- Evidence outcome=missing on a secure-by-default safety invariant. A normalization bug could silently mask real drift with no automated guard catching the regression.
[OSS Growth Hacker] Add inbound cross-links from ci-policy-setup, governance-guide, and making-the-case to the new drift-detection guide -- Guide currently has zero internal-link equity -- it's discoverable only via sidebar. Cross-links from high-traffic pages drive adoption of the new capability.

Architecture

classDiagram
    direction LR
    class ReplayConfig {
        <<ValueObject>>
        +project_root Path
        +lockfile_path Path
        +targets frozenset~str~ | None
        +cache_only bool
    }
    class DriftFinding {
        <<ValueObject>>
        +path str
        +kind str
        +package str
        +inline_diff str
    }
    class CacheMissError {
        <<Exception>>
    }
    class CheckLogger {
        <<Base+Subclass>>
        +replay_start()
        +diff_start()
        +replay_complete(n)
        +clean()
        +findings(n)
    }
    class CommandLogger {
        <<Base>>
        +verbose bool
        +error(msg)
        +warning(msg)
        +success(msg)
    }
    class _ReadOnlyProjectGuard {
        <<ContextManager>>
        +project_root Path
        +protected_roots list~Path~
        +__enter__()
        +__exit__()
    }
    class ProtectedPathMutationError {
        <<Exception>>
    }
    class DiagnosticCollector {
        <<Collect-then-render>>
        +drift(msg, path)
        +drift_count int
    }
    class integrate_package_primitives {
        <<IOBoundary>>
        +scratch_root Path | None
    }
    CheckLogger --|> CommandLogger : extends
    _ReadOnlyProjectGuard ..> ProtectedPathMutationError : raises
    class ReplayConfig:::touched
    class DriftFinding:::touched
    class CacheMissError:::touched
    class CheckLogger:::touched
    class _ReadOnlyProjectGuard:::touched
    class ProtectedPathMutationError:::touched
    class integrate_package_primitives:::touched
    classDef touched fill:#fff3b0,stroke:#d47600

flowchart TD
    A["apm audit (commands/audit.py)"] --> B{--ci flag?}
    B -- yes --> C["_audit_ci_gate()"]
    B -- no --> D["_audit_content_scan()"]
    C --> E{"--no-drift?"}
    D --> F{"--no-drift AND no --file/--strip/--package?"}
    E -- no --> G["[NET] _check_drift(project_root, lockfile)"]
    F -- no --> G
    E -- yes --> H["skip drift, warn on stderr"]
    F -- yes --> H
    G --> I["run_replay(ReplayConfig, CheckLogger)"]
    I --> J["[FS] _make_scratch_root() tempfile.mkdtemp + atexit"]
    J --> K["_assert_scratch_bound(project, scratch)"]
    K --> L{"for lock_dep in lockfile"}
    L --> M["[FS] _materialize_install_path() resolve cache path"]
    M --> N["_build_package_info() loads apm.yml if present"]
    N --> O["[FS] integrate_package_primitives(scratch_root=scratch_root)"]
    O --> P["[FS] ensure_path_within(project_root, scratch_root)"]
    P --> L
    L -- done --> Q["diff_scratch_against_project()"]
    Q --> R["[FS] _walk_managed(scratch, governed_roots)"]
    Q --> S["[FS] _walk_managed(project, governed_roots)"]
    R --> T["_normalize() on each file pair (BOM + CRLF + Build-ID strip)"]
    S --> T
    T --> U["emit list~DriftFinding~"]
    U --> V{"format?"}
    V -- text --> W["render_drift_text()"]
    V -- json --> X["render_drift_json()"]
    V -- sarif --> Y["render_drift_sarif()"]

sequenceDiagram
    participant User
    participant AuditCmd as commands/audit.py
    participant CIChecks as policy/ci_checks.py
    participant Drift as install/drift.py
    participant Services as install/services.py
    participant FS as Filesystem

    User->>AuditCmd: apm audit [--ci]
    AuditCmd->>CIChecks: _check_drift(project_root, lockfile)
    CIChecks->>Drift: run_replay(ReplayConfig, CheckLogger)
    Drift->>FS: mkdtemp (scratch dir)
    Drift->>Drift: _assert_scratch_bound(project, scratch)
    loop each locked dependency
        Drift->>FS: _materialize_install_path (cache lookup)
        Drift->>Services: integrate_package_primitives(scratch_root=scratch)
        Services->>Services: ensure_path_within(project_root, scratch_root)
        Services->>FS: write primitives to scratch
    end
    CIChecks->>Drift: diff_scratch_against_project(scratch, project, lockfile)
    Drift->>FS: _walk_managed(scratch) + _walk_managed(project)
    Drift->>Drift: _normalize() each pair
    Drift-->>CIChecks: list[DriftFinding]
    CIChecks-->>AuditCmd: (CheckResult, findings)
    AuditCmd->>User: render_drift(findings, format)

Recommendation

Land the recovery hint and doc-sync fixes in this PR (both are trivial, same-day work), then merge. Track cache-integrity verification and inverse-normalization test as follow-up issues. The feature is architecturally sound, the test suite is strong (43 new tests), and the positioning value is high. No reason to delay beyond the in-PR polish pass.

Full per-persona findings

Python Architect

[recommended] scratch_root kwarg on integrate_package_primitives leaks test-double / audit concerns into the production API surface at src/apm_cli/install/services.py:96
Adding scratch_root to the production install function's signature couples a drift-audit implementation detail into the core pipeline. Cleaner seam: DeployTarget protocol injected by callers.
[recommended] Normalization helpers are drift.py-private but will be needed by compile, install verify, and future audit modes at src/apm_cli/install/drift.py:96
_strip_build_id, _normalize_line_endings, _strip_bom belong in utils/normalization.py with a public API.
[recommended] _read_apm_yml_target couples drift module to manifest format at src/apm_cli/install/drift.py:334
Tight coupling to apm.yml schema; pragmatic today, but a 'load project config' helper is the right seam.
[nit] render_drift_json return type annotation could be more precise
TypedDict or dict[str, list[dict[str, str]]] gives downstream callers type-safe access.
[nit] atexit.register may leave temp dirs on SIGKILL
Use tempfile.TemporaryDirectory context manager pattern for both atexit AND explicit cleanup.

CLI Logging Expert

[recommended] render_drift_text and _render_drift_group omit an actionable remediation hint telling the user how to fix detected drift at src/apm_cli/install/drift.py:640
Message Writing Rule Add ARM64 Linux support to CI/CD pipeline #4 'Include the fix'. Drift findings show WHAT but never HOW. Append 'Run apm install to re-sync deployed files with the lockfile.'
[nit] CheckLogger emits stderr progress unconditionally -- no quiet-mode or non-TTY suppression path
[nit] _inline_diff_for never produces an actual diff for sub-cap files; docstring is misleading
[nit] Verbose mode discloses tracemalloc peak but not scratch tmpdir path -- agent debugging is harder

DevX UX Expert

[recommended] render_drift_text never tells the user HOW to fix drift -- the single remediation command is missing at src/apm_cli/install/drift.py
Failure mode is the product. npm audit prints 'run npm audit fix'; APM should print 'Run: apm install'.
[recommended] docs/src/content/docs/reference/cli-commands.md not updated -- PR is incomplete per doc-sync Rule 4 at docs/src/content/docs/reference/cli-commands.md
Persona scope: 'If a CLI change is not reflected in cli-commands.md in the same PR, that change is incomplete by definition.' --no-drift not in flags table.
[nit] --no-drift mutex error could name the reason more concretely
[nit] CHANGELOG entry is a single monolithic paragraph -- harder to scan than bullet points

Supply Chain Security Expert

[recommended] Drift replay trusts apm_modules cache without verifying content matches lockfile resolved_commit at src/apm_cli/install/drift.py:231
_materialize_install_path checks only EXISTS, never that content/git HEAD matches lockfile.resolved_commit. A poisoned cache (CI runners with shared caches, mounted volumes) would produce false-clean. Security model says commit-SHA pinning is the integrity primitive; replay bypasses it.
[recommended] Air-gap test blocks subprocess binaries (gh/curl/wget) but not socket-level Python network calls at tests/integration/test_drift_check_e2e.py:84
Patches subprocess.run/Popen but not socket.socket. Future requests/urllib egress would not be caught.
[recommended] atexit-based scratch cleanup is not SIGTERM/SIGKILL safe; sensitive deployed content may persist on shared CI runners at src/apm_cli/install/drift.py:142
Python atexit doesn't fire on SIGKILL; CI cancellation = SIGTERM+SIGKILL. Scratch /tmp persists with replayed primitives.
[nit] inline_diff field could leak secrets if expanded -- add SECURITY comment
[nit] Positive finding: drift replay correctly avoids credential paths (zero token references; NotImplementedError on cache_only=False)

OSS Growth Hacker

[recommended] Drift-detection guide is an island -- no inbound links from ci-policy-setup, governance-guide, or making-the-case
Guide links OUT but nothing links IN. Zero internal-link equity.
[recommended] CHANGELOG entry buries the hook in implementation detail -- rewrite lead sentence as the one-liner users can repost
[recommended] No 60-second demo path for adopters -- the guide explains but never invites the reader to try
[nit] README 'Content security' bullet mentions audit but not drift -- missed hook
[nit] Guide does not name the comparable ecosystem tools (npm audit, terraform plan -detailed-exitcode)

Auth Expert -- inactive

PR does not touch any auth-relevant file; drift replay is cache-only with no downloader/clone/token surface -- verified via grep + drift.py read. Network-enabled replay is explicitly gated by NotImplementedError.

Doc Writer

[recommended] integrations/ci-cd.md still documents the bash git-status workaround that this PR explicitly supersedes at docs/src/content/docs/integrations/ci-cd.md:61
Lines 61-79 still recommend the legacy pattern under 'Verify Deployed Primitives'; 'We dogfood this' callout still cites the old ci.yml. Most important doc-sync gap on the PR.
[recommended] reference/cli-commands.md does not document --no-drift and still describes --ci as 7-check-only at docs/src/content/docs/reference/cli-commands.md:462
[recommended] quick-start.md sends new users to the now-obsolete drift section in ci-cd.md at docs/src/content/docs/getting-started/quick-start.md:159
[recommended] drift-detection.md case Integrate copilot runtime #2 in 'When to use --no-drift' contradicts the CLI behavior described two paragraphs later at docs/src/content/docs/guides/drift-detection.md:67
Lines 67-69 list 'Strip-mode invocations' as a reason to use --no-drift, but audit.py auto-skips drift in strip/file mode AND raises UsageError if combined. Self-contradiction.
[recommended] CHANGELOG entry violates Keep-a-Changelog 'one line per PR' rule
[nit] New page is not back-linked from ci-policy-setup.md or governance-guide.md
[nit] sidebar order: 7 collides with dev-only-primitives.md and org-packages.md

Test Coverage Expert

[recommended] No test asserts normalization guards (BOM/CRLF/Build-ID) do NOT mask real content drift at tests/unit/install/test_drift.py
Tests prove cosmetic-only changes are suppressed. None asserts the inverse -- the safety invariant. A future normalization bug could silently mask real drift.
Proof (missing at): tests/unit/install/test_drift.py -- proves: Normalization guards suppress only cosmetic changes, never real drift [secure-by-default,governed-by-policy]
[recommended] CacheMissError raised by run_replay has no direct unit test exercising the raise paths at tests/unit/install/test_drift.py
Defined and raised in 5 locations; only command-layer swallowing is tested. No direct pytest.raises assertion.
Proof (missing at): tests/unit/install/test_drift.py -- proves: Cache-miss paths in drift replay raise CacheMissError so the audit command can catch and handle gracefully [devx,governed-by-policy]

_{This panel is advisory. It does not block merge. Re-apply the

panel-review label after addressing feedback to re-run.}

CEO panel recommended landing two in-PR follow-ups before merge: 1. Recovery hint in drift output (cli-logging + devx-ux convergence): render_drift_text now appends '[i] Run apm install to re-sync deployed files with the lockfile.' so users see WHAT and HOW in one message. Honors Message Writing Rule #4 'Include the fix'. 2. Doc-sync (doc-writer + devx-ux convergence): - reference/cli-commands.md: add --no-drift to audit options table; amend --ci description to mention drift contribution. - integrations/ci-cd.md: replace bash 'git status --porcelain' workaround under 'Verify Deployed Primitives' with 'apm audit --ci' one-liner; update 'We dogfood this' callout text. - getting-started/quick-start.md: retarget stale cross-ref from the now-superseded ci-cd anchor to the new drift-detection guide. - guides/drift-detection.md: drop the self-contradictory case #2 in 'When to use --no-drift' (strip-mode is auto-skipped, not opt-out). - CHANGELOG.md: compress verbose entry to one Keep-a-Changelog line pointing readers to the guide for detail. Tracked as follow-up issues (CEO call): - supply-chain: verify cache content matches lockfile resolved_commit before drift replay trusts it (commit-SHA pinning bypass on shared CI caches). - test-coverage: inverse-normalization unit test asserting BOM/CRLF/ Build-ID guards do NOT mask real content drift (safety invariant). Lint clean. 45 drift tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

danielmeppiel · 2026-05-04T21:54:04Z

Addressed in-PR (commit `7907865`)

Per the CEO panel recommendation, landed the two convergent in-PR follow-ups:

1. Recovery hint (cli-logging + devx-ux):
render_drift_text now appends [i] Run 'apm install' to re-sync deployed files with the lockfile. -- the diagnostic now tells users WHAT drifted AND HOW to fix it in one message. Honors Message Writing Rule #4 ('Include the fix').

2. Doc-sync (doc-writer + devx-ux):

reference/cli-commands.md -- added --no-drift to audit options; amended --ci to mention drift contribution.
integrations/ci-cd.md -- replaced the now-superseded bash git status --porcelain workaround with apm audit --ci; updated 'We dogfood this' callout.
getting-started/quick-start.md -- retargeted stale cross-ref to the new drift-detection guide.
guides/drift-detection.md -- removed the self-contradictory case Integrate copilot runtime #2 in 'When to use --no-drift' (strip-mode is auto-skipped, not opt-out).
CHANGELOG.md -- compressed to one Keep-a-Changelog line pointing readers to the guide.

All 45 drift tests pass; lint clean.

Tracked as follow-up issues (CEO call)

supply-chain-security: verify cache content matches lockfile resolved_commit before drift replay trusts it (commit-SHA pinning bypass on shared CI caches with mounted volumes). Will file as a separate issue.
test-coverage: inverse-normalization unit test asserting BOM/CRLF/Build-ID guards do NOT mask real content drift (safety invariant on a secure-by-default surface). Will file as a separate issue.
growth: inbound cross-links from ci-policy-setup, governance-guide, making-the-case to the drift-detection guide (internal-link equity). Cheap follow-up PR.
architecture: scratch_root kwarg formalization as a DeployTarget protocol; lift normalization helpers to utils/normalization.py for sharing with compile/verify pipelines.
logging: TTY/--quiet awareness in CheckLogger; verbose scratch-path disclosure.

Status

PR is ready for review. CEO stance: ship_with_followups.

Copilot

Pull request overview

This PR extends apm audit to perform integration drift detection by default by replaying the install/integration pipeline into a scratch directory (cache-only) and diffing the result against the working tree. It adds a new drift engine, wires drift into text/JSON/SARIF outputs, updates CI/docs, and introduces extensive unit/integration/e2e coverage.

Changes:

Add a cache-only “install replay + diff” drift engine with renderers (text/JSON/SARIF) and normalization guards (Build ID / CRLF / BOM).
Wire drift detection into apm audit (default-on) with --no-drift opt-out, and propagate findings into CI JSON + SARIF payloads.
Add read-only defensive guard + substantial test coverage; update CI scripts and docs to reflect the new drift surface.

Show a summary per file

File	Description
uv.lock	Bumps apm-cli version to 0.12.1.
src/apm_cli/install/drift.py	New drift replay + diff engine with normalization + renderers.
src/apm_cli/install/services.py	Adds `scratch_root` kwarg + safety guard for replayed integrations.
src/apm_cli/utils/guards.py	New read-only project-tree guard used as defense-in-depth for drift.
src/apm_cli/utils/diagnostics.py	Adds drift diagnostic category + rendering.
src/apm_cli/policy/ci_checks.py	Adds `_check_drift()` to run replay+diff and return findings.
src/apm_cli/commands/audit.py	Adds `--no-drift`, runs drift by default, integrates drift into JSON/SARIF/text output paths.
tests/unit/utils/test_guards.py	Unit coverage for the read-only guard behavior (create/modify/delete).
tests/unit/utils/init.py	Ensures `tests.unit.utils` is a package for imports/discovery.
tests/unit/test_audit_policy_command.py	Updates audit policy unit tests to disable drift (stabilize expectations).
tests/unit/test_audit_ci_command.py	Updates CI-mode audit unit tests to disable drift (stabilize expectations).
tests/unit/test_audit_ci_auto_discovery.py	Updates auto-discovery unit tests to disable drift.
tests/unit/install/test_drift.py	Unit tests for drift normalization, diff kinds, SARIF rule IDs, and stderr-only phase logging.
tests/unit/install/test_drift_perf.py	Performance smoke test for replay+diff under a 5s budget for 100 primitives.
tests/integration/test_drift_check.py	Integration tests covering drift kinds, edges, multi-target behavior, and `--no-drift` mutex.
tests/integration/test_drift_check_e2e.py	E2E tests for no-write contract, air-gap behavior, performance smoke, and JSON/SARIF shape stability.
scripts/test-integration.sh	Ensures new drift integration/e2e suites run in CI integration job.
packages/apm-guide/.apm/skills/apm-usage/commands.md	Updates apm-guide command reference for drift default + `--no-drift`.
docs/src/content/docs/reference/cli-commands.md	Documents `--no-drift` and drift behavior in `apm audit --ci`.
docs/src/content/docs/integrations/ci-cd.md	Replaces legacy `git status` drift gate with `apm audit --ci`.
docs/src/content/docs/guides/drift-detection.md	New/updated guide describing drift behavior, formats, and CI usage.
docs/src/content/docs/getting-started/quick-start.md	Updates link/reference to drift checking guidance.
CHANGELOG.md	Adds Unreleased entry for default-on drift detection.
.github/workflows/ci.yml	Notes legacy bash drift step as redundant once apm-action picks up this CLI version.

Copilot's findings

Comments suppressed due to low confidence (2)

src/apm_cli/policy/ci_checks.py:470

_check_drift() resolves targets via resolve_targets(project_root) (auto-detect). If a project declares target: in apm.yml but the corresponding deployment dirs don't exist yet, auto-detect can omit those targets and the diff will miss drift (e.g., unintegrated outputs under .claude/, .cursor/, etc.). Consider resolving targets with the same explicit target selection used by run_replay() (read from apm.yml) so governed roots are complete even when dirs are absent.

    logger.diff_start()
    resolved_targets = resolve_targets(project_root)
    if targets:
        resolved_targets = [t for t in resolved_targets if t.name in set(targets)]
    findings = diff_scratch_against_project(scratch, project_root, lockfile, resolved_targets)

docs/src/content/docs/guides/drift-detection.md:94

The JSON example shows drift as a list, but the implementation/tests treat the top-level drift key as an object containing a drift list (i.e., payload['drift']['drift']). Please update this example (or adjust the implementation) so the documented schema matches the actual output shape.

**JSON** -- the audit report gains a top-level `drift` key:

```json
{
  "report_format_version": "1.0",
  "checks": [...],
  "drift": [
    {
      "path": ".github/instructions/foo.md",
      "kind": "modified",
      "package": "<local>",
      "inline_diff": "..."
    }
  ]
}

</details>


- **Files reviewed:** 22/24 changed files
- **Comments generated:** 8

+def _snapshot(paths: list[Path]) -> dict[Path, tuple[float, int] | None]:
+    """Capture (mtime_ns, size) for each path, or ``None`` if missing.
+
+    Symlinks are followed; missing paths record ``None`` so they may
+    legitimately remain absent without triggering the guard.
+    """
+    snap: dict[Path, tuple[float, int] | None] = {}
+    for p in paths:
+        try:
+            st = p.stat()
+            snap[p] = (st.st_mtime_ns, st.st_size)
+        except FileNotFoundError:


+                passed=False,
+                message=(
+                    f"drift replay aborted: {exc} -- "
+                    "run 'apm install' first or use --no-cache (not yet supported)"


+def _check_drift(
+    project_root: Path,
+    lockfile,
+    targets=None,
+    cache_only: bool = True,
+    verbose: bool = False,
+) -> tuple[CheckResult, list]:
+    """Replay the install in a scratch dir and diff against the project.


+        for stream_name, stream in (("stdout", result.stdout), ("stderr", result.stderr)):
+            text = stream or ""
+            for ch in text:
+                assert ord(ch) < 128 or ch in {"\u2500", "\u2501"} or ord(ch) > 0xFFFF or True, (


+# Drift severities: kinds of divergence from the lockfile-defined state.
+DRIFT_MODIFIED = "modified"  # tracked file content changed
+DRIFT_UNINTEGRATED = "unintegrated"  # tracked file missing from project
+DRIFT_ORPHANED = "orphaned"  # untracked file present in managed dir



+### Added
+
+- **`apm audit` now detects integration drift by default.** Read-only cache-only install replay catches missed `apm install` runs, hand-edited deployed files, and orphaned files. Findings exposed in JSON (`drift` key) and SARIF (`apm/drift/<kind>` rules); in `--ci` mode they contribute to the exit code. Opt out with `--no-drift` (mutually exclusive with `--strip`/`--file`). See the [Drift Detection guide](docs/src/content/docs/guides/drift-detection.md) for details. (#1071, supersedes scope of #898)


+| `apm audit` | Reported in stdout | 0 (advisory only) |
+| `apm audit --ci` | Reported and counted as failure | 1 |
+| `apm audit --no-drift` | Skipped entirely | governed only by other checks |
+
+In `--ci` mode drift findings are pooled with the seven baseline lockfile
+checks (`lockfile-exists`, `ref-consistency`, etc.) -- a single
+non-zero exit covers all of them.


@@ -593,6 +670,10 @@ def _audit_content_scan(
        all_findings = [f for ff in findings_by_file.values() for f in ff]
        exit_code = 1 if ContentScanner.has_critical(all_findings) else 2

+    # Drift findings escalate exit code to 1 (critical).
+    if drift_failed and exit_code == 0:
+        exit_code = 1
+


…gnostics Bare 'apm audit' is advisory (exit 0 on drift); 'apm audit --ci' is the gate (exit 1). Closes the regression introduced when content-scan escalation accidentally also escalated drift findings. Also addresses inline review: - A2: vacuous ASCII-encoding assertion now scopes per-line - A4: tuple[float, int] -> tuple[int, int] in guards.py - A5: type-annotated _check_drift signature - A6: clarified DRIFT_ORPHANED comment - A7: CHANGELOG references PR + closes - A3: CacheMiss message now drift-specific (no --no-cache confusion) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Per oss-growth: surfaces drift detection alongside content security and lockfile integrity in the conversion-critical Production-grade section, so a reader scanning for 'why APM' sees the supply-chain story end-to-end. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

apm install drops a .apm-pin JSON marker into each cached package root recording the resolved_commit; apm audit verifies it before running drift replay. Catches the 'teammate bumped lockfile, did not reinstall' + 'shared CI runner reused stale apm_modules' scenarios that would otherwise silently produce misleading drift output. LockfileBuilder syncs markers UNCONDITIONALLY (even when the lockfile YAML is unchanged and even when no install happens), so existing users self-heal on their next 'apm install'. This is stale-cache detection, NOT cryptographic integrity -- defending against active cache tampering requires content-addressed hashes, which is deferred. Schema (v1): {schema_version: 1, resolved_commit: <sha>} Marker file: <install_path>/.apm-pin Coverage: - 14 unit tests in test_cache_pin.py (positive + every error path + skip rules + idempotent re-run + self-heal regression) - 1 integration test in test_drift_check_e2e.py exercising the full install -> mark -> verify flow against a synthetic cache Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

danielmeppiel · 2026-05-05T07:55:42Z

Follow-ups landed (3 commits on top of `feat/drift-detection`)

Addressed every Copilot inline review comment plus the panel-deferred follow-ups (supply-chain hardening, growth surface, additional test coverage, logging refinement) in the same PR.

Commit map

SHA	Scope
`582e079`	Copilot review fixes (A1-A7) -- exit-code contract + types + diagnostics + B3 normalization extract + B4 logging
`9e3354a`	README link to drift-detection guide (B5)
`978765d`	Cache-pin marker for stale-cache detection (B1)

Cache-pin marker contract (B1, supply-chain follow-up)

apm install drops a .apm-pin JSON marker ({schema_version: 1, resolved_commit: <sha>}) into each cached package root. apm audit verifies it before drift replay. Catches the two real-world hazards the panel called out:

Teammate bumps apm.lock.yaml (e.g. pin X -> Y) without re-running apm install -> drift would otherwise diff new lockfile against stale cache and report misleading findings.
Shared CI runner reuses apm_modules/ across builds where the lockfile changed between jobs.

LockfileBuilder syncs markers unconditionally (even when the lockfile YAML is unchanged AND when no install happens), so existing users self-heal on their next apm install. Threat model is documented in cache_pin.py -- this is stale-cache detection, NOT cryptographic integrity. Active-tampering defense (content-addressed hashes / signatures) is deferred.

Proof of integration test coverage

76 drift tests (was 74 prior to follow-ups; +14 unit cache_pin + new integration test):

File	Count	Scenarios
`tests/unit/install/test_drift.py`	19	Normalization (Build-ID, BOM, CRLF) + 3 inverse-norm regressions + CheckLogger stderr (incl. 2 new `scratch_root` verbose-gating tests) + replay correctness
`tests/unit/install/test_cache_pin.py`	14	NEW -- positive verify, missing marker, malformed JSON, non-object payload, unsupported schema_version, missing `resolved_commit` field, commit mismatch, idempotent re-write, missing-dir tolerance, sync skips local deps, sync skips no-commit deps, sync skips no-cache deps, self-heal regression
`tests/integration/test_drift_check.py`	32	9 drift cases + 4 panel-feedback regressions + 12 edges + 5 multi-target + 5 default-behavior/--no-drift
`tests/integration/test_drift_check_e2e.py`	11	E2E install-audit-tamper-reinstall loop + JSON/SARIF stable shape + ASCII-only output + bare-vs-CI exit-code regression + NEW cache-pin marker E2E (synthetic remote-style lockfile)

uv run pytest tests/unit/install/test_drift.py \
              tests/unit/install/test_cache_pin.py \
              tests/integration/test_drift_check.py \
              tests/integration/test_drift_check_e2e.py
============== 76 passed in 3.71s ==============

Plus the broader local slice ran clean: tests/unit/install/ tests/unit/policy/ tests/unit/utils/ + drift integration -> 1142 passed, 17 subtests passed.

Coverage of past + present drift hazards

Hazard	Test reference
Build-ID line spuriously reported as drift	`test_drift.py::test_normalize_strips_build_id_lines`
Inverse: real change near a Build-ID line not masked	`test_drift.py::test_normalize_does_not_mask_real_drift_near_build_id`
BOM / CRLF spurious drift	`test_drift.py::test_normalize__bom`, `_crlf`
Inverse: real change masked by BOM/CRLF strip	`test_drift.py::test_normalize_does_not_mask_real_drift_*`
Bare `apm audit` exits non-zero on drift (regression)	`test_drift_check_e2e.py::test_bare_audit_with_drift_exits_zero_but_ci_audit_exits_one`
`apm audit --ci` does NOT gate on drift (regression)	same
Stale cache after lockfile bump produces false drift	`test_cache_pin.py` (14 tests) + `test_drift_check_e2e.py::test_apm_install_writes_cache_pin_marker_for_each_remote_dep`
Cache pre-dating marker contract is never marked	`test_cache_pin.py::test_sync_markers_self_heals_caches_missing_marker` + integration counterpart
Marker tampering / corruption	`test_cache_pin.py::test_verify_*` (5 error paths)

Lint

uv run --extra dev ruff check src/ tests/   -> All checks passed!
uv run --extra dev ruff format --check src/ tests/   -> 681 files already formatted

github-actions · 2026-05-05T08:48:43Z

APM Review Panel: `ship_with_followups`

PR #1137 ships default-on integration drift detection in apm audit, eliminating bespoke CI bash glue and closing the gap between what apm.yml declares and what actually lives in apm_modules.

cc @danielmeppiel @sergio-sisternes-epam -- a fresh advisory pass is ready for your review.

The panel converges strongly: 77 tests, 21 scenario-evidence rows, and full E2E coverage make this one of the best-instrumented PRs in the project's history. No panelist found a correctness regression or security bypass. The two highest-signal gaps identified are tightly coupled: the _ReadOnlyProjectGuard in guards.py is defined but never wired into run_replay(), meaning the defense-in-depth no-write contract is unwired dead code; and CacheMissError raised during drift replay in bare apm audit is silently swallowed, giving the user zero feedback. Both concern the same promise -- that drift replay is safe and observable. These should be addressed as a paired follow-up, not a blocker, because the primary protection (_assert_scratch_bound) is in place and the E2E tests confirm the happy path.

The supply-chain-security finding on resolved_commit None silently skipping cache-pin verification for remote deps is the third-highest signal item, carrying a secure-by-default tag on an outcome: missing evidence row. Per the CEO evidence-weighting rules, a missing test on a secure-by-default surface inherits near-blocking weight even when the persona filed it as recommended. This should be the first follow-up issue filed post-merge. The doc-writer's findings on stale '7 baseline checks' copy in ci-cd.md, governance-guide.md, and security.md are straightforward and should be bundled into the same follow-up PR as the CHANGELOG link fix.

The oss-growth-hacker is correct that the before/after CI gate narrative is the strongest launch asset this project has shipped in recent memory. The CHANGELOG entry and drift-detection guide need a light rewrite pass before this becomes release-note source material -- but that is a post-merge content task, not a gate. Auth-expert is inactive and in full agreement: no auth surface was touched.

Aligned with: Secure by default -- drift runs by default; the scratch-bound guard and cache-pin verification enforce the no-mutation contract, though _ReadOnlyProjectGuard and the resolved_commit None silent-skip both need follow-up. Governed by policy -- apm audit --ci now gates on drift as well as lockfile consistency; the stale '7 baseline checks' copy must be updated. Pragmatic as npm -- the 5-line CI bash collapses to a single apm audit invocation; this is the pragmatic payoff the feature was designed to deliver.

Growth signal. The before/after CI gate narrative -- 5 lines of bash becoming one apm audit invocation -- is a screenshot-ready launch asset. To convert that asset into actual adoption signal, the drift-detection guide needs a '## Try it now' copy-paste block and the CHANGELOG entry needs to lead with the user promise ('no bash glue needed') before the flag semantics. Both are 10-minute edits that compound the PR's reach significantly.

Panel summary

Persona	R	N	Takeaway
Python Architect	1	2	Solid replay-engine pattern; `_ReadOnlyProjectGuard` is defined but never wired into `run_replay()` -- dead code.
CLI Logging Expert	2	1	`scratch_root()` stderr routing is correct (B4 resolved); modified-drift text color violates traffic light; 3 `CacheMissError` messages missing actionable fix.
DevX UX Expert	2	1	`CacheMissError` silently swallowed in bare `apm audit`; module docstring contradicts shipped feature.
Supply Chain Security Expert	2	1	`_assert_scratch_bound` bypasses canonical path guard API; `resolved_commit=None` silently skips cache-pin verification.
OSS Growth Hacker	3	1	Strong CI friction-reduction story; CHANGELOG reads as implementation prose, guide lacks a 60-second demo.
Doc Writer	2	3	Guide is PROSE-compliant; stale '7 baseline checks' in ci-cd.md/governance-guide.md understates new behavior; security.md has no drift mention.
Test Coverage Expert	1	2	Core drift scenarios and cache-pin error paths are well-covered; self-heal branch E2E reachability is unverifiable without running the test.

B = blocking-severity findings, R = recommended, N = nits.
Counts are signal strength, not gates. The maintainer ships.

Top 5 follow-ups

[Supply Chain Security Expert] Remote deps with resolved_commit=None silently skip cache-pin verification; add a CacheMissError or visible warning for this case, plus a unit test proving the behavior. -- outcome:missing on a secure-by-default surface inherits near-blocking weight per CEO evidence rules; this is the highest-risk silent fail-open in the PR.
[Python Architect] Wire _ReadOnlyProjectGuard into run_replay() -- it is currently defined in guards.py but never imported or called. -- The defense-in-depth no-write contract is dead code; the module docstring explicitly describes it as the intended wrapping pattern for run_replay().
[DevX UX Expert] Emit a stderr message when CacheMissError is swallowed in bare apm audit so users know drift was attempted and why it failed. -- Zero feedback on a stale-cache failure violates the failure-mode principle; confirmed by both devx-ux and cli-logging-expert independently.
[Doc Writer] Update ci-cd.md, governance-guide.md, and security.md to replace '7 baseline checks' with the new default-on drift count, and fix the CHANGELOG relative-path link. -- Public doc surfaces understate shipped behavior; governance and security docs are the first place enterprise evaluators look.
[OSS Growth Hacker] Rewrite the CHANGELOG entry to lead with the user promise, add a 'Try it now' block to the drift-detection guide, and add a section heading to the before/after CI comparison. -- The strongest launch asset in this PR is currently buried mid-page; a 10-minute edit multiplies reach before the release note is cut.

Architecture

classDiagram
    direction TB
    class audit_command {
        <<EntryPoint>>
        +--ci bool
        +--no-drift bool
    }
    class ReplayConfig {
        <<ValueObject>>
        +project_root Path
        +lockfile_path Path
        +targets frozenset
        +cache_only bool
    }
    class DriftFinding {
        <<ValueObject>>
        +path str
        +kind str
        +package str
        +inline_diff str
    }
    class CheckLogger {
        +replay_start()
        +diff_start()
        +replay_complete(n int)
        +clean()
        +findings(n int)
    }
    class _ReadOnlyProjectGuard {
        <<ContextManager>>
        +project_root Path
        +protected_roots list
        +__enter__()
        +__exit__()
    }
    class CacheMissError {
        <<Exception>>
    }
    class CachePinError {
        <<Exception>>
    }
    class run_replay {
        <<IOBoundary>>
        +config ReplayConfig
        +logger CheckLogger
    }
    class diff_scratch_against_project {
        <<Pure>>
        +scratch_root Path
        +project_root Path
    }
    class verify_marker {
        <<IOBoundary>>
    }
    class write_marker {
        <<IOBoundary>>
    }
    note for CheckLogger "Subclass pattern: redirects emit() to stderr so JSON/SARIF stdout stays clean"
    note for _ReadOnlyProjectGuard "Defined in guards.py; NOT wired into run_replay() -- dead code gap (follow-up #2)"
    note for run_replay "Replay Engine: cache-only materialization into scratch_root via integrate_package_primitives"
    run_replay ..> ReplayConfig : reads
    run_replay ..> CheckLogger : logs via
    run_replay ..> verify_marker : stale-cache guard
    run_replay ..> CacheMissError : raises
    run_replay ..> diff_scratch_against_project : calls
    diff_scratch_against_project ..> DriftFinding : returns list
    verify_marker ..> CachePinError : raises on mismatch
    _ReadOnlyProjectGuard ..> CacheMissError : raises
    audit_command ..> run_replay : invokes via _check_drift
    class ReplayConfig:::touched
    class DriftFinding:::touched
    class CacheMissError:::touched
    class CachePinError:::touched
    class CheckLogger:::touched
    class _ReadOnlyProjectGuard:::touched
    class run_replay:::touched
    class diff_scratch_against_project:::touched
    class verify_marker:::touched
    class write_marker:::touched
    classDef touched fill:#fff3b0,stroke:#d47600

flowchart TD
    A([apm audit]) --> B{--ci?}
    B -->|yes| C["_audit_ci_gate()\naudit.py:395"]
    B -->|no| D["_audit_content_scan()\naudit.py:552"]

    C --> C1["run_baseline_checks()\npolicy/ci_checks.py"]
    C --> C2{not --no-drift\nand apm.yml?}
    C2 -->|yes| DR["_check_drift()\npolicy/ci_checks.py"]
    C2 -->|no| C3["[stderr] drift skipped"]

    D --> D1["[FS] scan_lockfile_packages()"]
    D --> D2{not --no-drift and\napm.yml and no --file/--strip?}
    D2 -->|yes| DR
    D2 -->|no| D3["[stderr] drift skipped"]

    DR --> RP["run_replay(ReplayConfig)\ndrift.py:374"]
    RP --> RP1["[FS] _make_scratch_root()\ntempfile.mkdtemp + atexit.register"]
    RP --> RP2["_assert_scratch_bound()\npoint-in-time only"]
    RP --> RP3{for each lock_dep\nin LockFile}
    RP3 --> RP4["[FS] _materialize_install_path()\ndrift.py:198"]
    RP4 --> RP5["[FS] verify_marker(candidate, commit)\ncache_pin.py -- CachePinError->CacheMissError"]
    RP5 --> RP6["[FS] integrate_package_primitives()\nservices.py scratch_root=scratch"]
    RP6 --> RP3
    RP3 -->|all done| DIFF["diff_scratch_against_project()\ndrift.py:544"]
    DIFF --> FIND["list[DriftFinding]\nmodified / unintegrated / orphaned"]

    FIND --> EC{--ci mode?}
    EC -->|yes| EXIT1(["sys.exit 0 or 1\ndrift escalates in --ci"])
    EC -->|no| EXIT0(["sys.exit 0\ndrift advisory only"])

sequenceDiagram
    actor User
    participant CLI as audit.py
    participant DriftChk as ci_checks._check_drift
    participant Replay as drift.run_replay
    participant Pin as cache_pin.verify_marker
    participant Svc as services.integrate_package_primitives
    participant Diff as drift.diff_scratch_against_project

    User->>CLI: apm audit [--ci] [--no-drift]
    CLI->>DriftChk: _check_drift(project_root, lockfile, cache_only=True)
    DriftChk->>Replay: run_replay(ReplayConfig)
    Replay->>Replay: _make_scratch_root() -- tempfile.mkdtemp
    Replay->>Replay: _assert_scratch_bound() -- point-in-time check
    loop for each lock_dep
        Replay->>Pin: verify_marker(install_path, resolved_commit)
        alt marker ok
            Pin-->>Replay: ok
        else CachePinError
            Pin-->>Replay: raise CachePinError
            Replay-->>DriftChk: raise CacheMissError
        end
        Replay->>Svc: integrate_package_primitives(scratch_root=scratch)
        Svc-->>Replay: files written to scratch only
    end
    Replay-->>DriftChk: scratch_root Path
    DriftChk->>Diff: diff_scratch_against_project(scratch, project, lockfile)
    Diff-->>DriftChk: list[DriftFinding]
    DriftChk-->>CLI: (CheckResult, drift_findings)
    alt --ci mode
        CLI->>User: render + sys.exit(1) if drift
    else bare apm audit
        CLI->>User: render (advisory) + sys.exit(0)
    end

Recommendation

Merge now. The PR is well-tested, the primary safety guard is in place, and no panelist found a blocking defect. File three follow-up issues immediately post-merge: (1) wire _ReadOnlyProjectGuard + emit stderr on CacheMissError swallow (paired, same PR); (2) add resolved_commit=None warning/error with a secure-by-default unit test; (3) doc pass for stale check-count copy and CHANGELOG link. Track the growth-hacker content edits alongside the doc pass -- they are cheap and the launch window is now.

Full per-persona findings

Python Architect

[recommended] _ReadOnlyProjectGuard in guards.py is defined but never imported or called by run_replay() at src/apm_cli/utils/guards.py:60
The module docstring on guards.py explicitly states it is the defense-in-depth check for drift replay and shows a usage example wrapping run_replay(). Grepping src/ confirms _ReadOnlyProjectGuard has zero import or call sites outside its own file. The primary protection is _assert_scratch_bound() (drift.py:108), which is a point-in-time check at scratch dir creation -- it does NOT detect accidental real-project writes that could occur later. Leaving the guard unconnected means the no-write contract is enforced by convention only, not by a runtime assertion.
Suggested: In run_replay() (drift.py:424), wrap the inner replay loop: with _ReadOnlyProjectGuard(project_root, ['.apm', 'apm.lock.yaml', '.github']): for lock_dep in lock.get_all_dependencies(): ...
[nit] _inline_diff_for() returns empty string for in-range files; DriftFinding.inline_diff is always empty in practice at src/apm_cli/install/drift.py:532
drift.py:532-541 defines _inline_diff_for but only handles the too-large case. DriftFinding.inline_diff is therefore always '' or the size note, never actual diff content.
[nit] normalization.py exports underscore-prefixed private names in __all__ at src/apm_cli/utils/normalization.py:50
__all__ includes _BOM, _normalize, _normalize_line_endings, _strip_bom, _strip_build_id -- private-by-convention names. Mixed signals to future contributors and static analysis tools.

CLI Logging Expert

[recommended] Modified drift entries rendered yellow in text mode; traffic light rule requires red for error-severity findings at src/apm_cli/utils/diagnostics.py:462
diagnostics.py _render_drift_group renders modified, unintegrated, and orphaned drift items all with color='yellow'. The SARIF renderer correctly classifies modified as level='error', but the text/diagnostic path does not distinguish it. Hand-edited deployed files (modified) are the highest-severity drift case and should use red.
Suggested: color = 'red' if label == 'modified' else 'yellow'
[recommended] Three CacheMissError raises lack 'run apm install' actionable text at src/apm_cli/install/drift.py:221
Lines 221, 225 (drift.py) and 397 surface as user-visible errors but give no next step. Lines 232-234 and 246 correctly include 'run apm install'. Inconsistency means users hit the first two local-source errors and have no guidance.
Suggested: Lines 221/225: append '-- run apm install'. Line 397: append '; run apm install to regenerate it'.
[nit] --no-drift mutex UsageError message explains why but omits the fix at src/apm_cli/commands/audit.py:884
Current message names the flags but does not tell the user what to do instead.
Suggested: Add: 'Run each mode separately: apm audit --no-drift for content scan, apm audit for drift detection.'

DevX UX Expert

[recommended] CacheMissError from drift replay is silently swallowed in bare apm audit; user gets no output at src/apm_cli/commands/audit.py:676
In _audit_content_scan, when _check_drift raises CacheMissError it returns (CheckResult(passed=False, message='...'), []). drift_findings=[] so the if drift_findings: branch never fires, and drift_failed is explicitly discarded. The user gets zero feedback that drift was attempted and failed due to a stale cache. Violates: "Failure mode is the product."
Suggested: After drift_check, drift_findings = _check_drift(...), check if not drift_check.passed and not drift_findings: and emit the check message to stderr.
[recommended] Module docstring (lines 5-6) says drift detection is 'planned as future modes' and names the flag --drift; both are wrong after this PR at src/apm_cli/commands/audit.py:5
The docstring reads: "lock-file consistency (--ci) and drift detection (--drift) are planned as future modes." This PR ships drift as default-on with --no-drift (not --drift), making the docstring doubly wrong.
Suggested: Replace with accurate description of the shipped behavior.
[nit] Mutex error message reads as self-defeating: 'those modes do not run drift detection' implies --no-drift is redundant at src/apm_cli/commands/audit.py:884
Confusing for users who wonder why the error fires at all.

Supply Chain Security Expert

[recommended] _assert_scratch_bound uses Path.relative_to() directly instead of the canonical ensure_path_within guard at src/apm_cli/install/drift.py:108
The persona contract requires every path-containment check to flow through ensure_path_within so symlink resolution, Windows extended-prefix stripping, and single-choke-point invariant are consistently applied. _assert_scratch_bound calls project_root.resolve() / scratch_root.resolve() then Path.relative_to() -- equivalent in practice but outside the canonical chokepoint. If ensure_path_within's logic ever tightens, _assert_scratch_bound silently diverges.
Suggested: Add assert_path_outside(child: Path, base: Path) to path_security.py, then replace _assert_scratch_bound's body with a call to it.
[recommended] resolved_commit=None/empty silently skips cache-pin verification in _materialize_install_path at src/apm_cli/install/drift.py:240
Lines 240-246: verify_marker is gated on if lock_dep.resolved_commit:. A lockfile entry whose resolved_commit is None or empty string bypasses stale-cache detection entirely and proceeds to diff against whatever is in apm_modules -- silently. For remote deps, this should emit a visible warning or raise CacheMissError.
Suggested: For remote deps (source != 'local'), treat a missing resolved_commit as a CacheMissError or emit a visible warning.
Proof (missing at unit): tests/unit/install/test_drift.py::test_remote_dep_none_commit_warns_or_raises -- proves: A remote dep with no resolved_commit causes a CacheMissError or explicit warning rather than silently skipping cache-pin verification [secure-by-default]
[nit] sync_markers_for_lockfile uses bare except Exception swallowing all install-path resolution errors at src/apm_cli/install/cache_pin.py:152
Line 152 catches all exceptions to keep marker sync best-effort, but also masks AttributeError/TypeError from programming mistakes.
Suggested: Narrow to (AttributeError, ValueError, OSError) and log at DEBUG level.

OSS Growth Hacker

[recommended] CHANGELOG entry is implementation prose, not a user story; release-note audience needs the benefit first at CHANGELOG.md:12
The first bullet leads with flag semantics (JSON keys, SARIF rule IDs, mutual-exclusion rules) before naming the user benefit. Existing users scanning the changelog to decide whether to upgrade need 'you no longer need bash glue in CI' in the first sentence.
Suggested: Open with: 'apm audit now catches missed installs, hand-edits, and orphaned files in one command -- no bash glue needed. CI gates drop from a 5-line git-status script to apm audit --ci.' Then append the flag/schema detail.
[recommended] Drift detection guide has no 60-second runnable demo; a reader cannot feel the feature without an existing project at docs/src/content/docs/guides/drift-detection.md
The guide explains the feature well but never gives a reader a copy-paste block they can run right now and see real output.
Suggested: Add a '## Try it now' section: git clone https://github.com/microsoft/apm && cd apm && apm install && apm audit
[recommended] The CI before/after is a quotable launch beat but buried mid-page with no skimmable heading at docs/src/content/docs/guides/drift-detection.md:102
The before/after code block (5-line bash -> single apm audit --ci) is the strongest friction-reduction proof. Currently it appears at line 102 with no named callout, no framing sentence, and no section heading.
Suggested: Rename 'CI integration' to 'Before and after: CI gate in one line' and add a lead sentence.
[nit] README bullet 'catch hand-edits before they ship' narrows the hook to one of three drift kinds at README.md:84
Users who forget to re-run install (unintegrated) may not self-identify as 'hand-editors'.
Suggested: 'catches missed installs, hand-edits, and orphaned files before they ship'

Auth Expert -- inactive

No auth-relevant surface touched; drift detection is cache-only with no network calls, auth resolution, token management, or credential handling.

Doc Writer

[recommended] ci-cd.md and governance-guide.md still describe apm audit --ci as running '7 baseline checks' -- drift is now a default-on eighth check at docs/src/content/docs/integrations/ci-cd.md:150
ci-cd.md and governance-guide.md repeat the '7 baseline' count. Since drift detection runs by default in --ci and contributes to exit code, every one of these statements now understates the behavior.
Suggested: Change to '7 baseline lockfile checks plus integration drift detection (default-on), no configuration'. Update governance-guide.md similarly.
[recommended] enterprise/security.md describes apm audit only as a content-scanning tool; drift detection is not mentioned at docs/src/content/docs/enterprise/security.md:130
Per the two-layer security model, all security-related sections must describe apm audit correctly. security.md enumerates apm audit capabilities with no mention of drift.
Suggested: Add a cross-reference to the Drift Detection guide after the existing apm audit bullet list.
[nit] CHANGELOG link uses a repo-relative file path, not a rendered docs URL -- inconsistent with all other entries at CHANGELOG.md:12
Every other hyperlink in the CHANGELOG uses https:// external URLs. A file path is not clickable from a rendered GitHub CHANGELOG view.
Suggested: Replace with `(microsoft.github.io/redacted)
[nit] drift-detection.md exit code table: --no-drift row value 'governed only by other checks' is ambiguous at docs/src/content/docs/guides/drift-detection.md:55
The other two rows give explicit values (0, 1). The vague phrase breaks the table's parallelism.
Suggested: '0 (advisory) or 1 with --ci -- same as baseline checks without drift'
[nit] drift-detection.md flowchart names 'scratch tmpdir' -- raises a question about cleanup guarantees not addressed in the guide at docs/src/content/docs/guides/drift-detection.md:33
'tmpdir' implies a system temp directory, raising a question about whether cleanup is guaranteed.
Suggested: 'Replay install into scratch tree (auto-cleaned)'

Test Coverage Expert

[recommended] LockfileBuilder._sync_cache_pin_markers_from_disk self-heal branch lacks a confirmed explicit integration test
The E2E test test_apm_install_writes_cache_pin_marker_for_each_remote_dep may or may not exercise the _sync_cache_pin_markers_from_disk branch (vs the in-memory path). Without running the test it is not possible to confirm which branch fires. If the self-heal path is unexercised, a future refactor could silently break the upgrade story.
Suggested: Add a dedicated integration test: install a project, wipe all .apm-pin markers, run apm install without changing apm.yml, assert markers are re-created.
Proof (unknown at e2e): tests/integration/test_drift_check_e2e.py::TestDriftE2E::test_apm_install_writes_cache_pin_marker_for_each_remote_dep -- proves: apm install self-heals .apm-pin markers for pre-existing caches [secure-by-default, devx]
assert marker.exists(), 'install must self-heal pre-existing caches by writing the marker'
[nit] diagnostics._render_drift_group emits no remediation hint; no test guards against that omission at tests/integration/test_drift_check.py:711
No test asserts a recovery hint ('run apm install') in drift text output.
Suggested: Add assert 'apm install' in combined to test_e9_text_mode_drift_summary_includes_kind_and_path.
Proof (missing at integration-with-fixtures): tests/integration/test_drift_check.py::TestSectionENoDriftFlag::test_e9_text_mode_drift_summary_includes_kind_and_path
assert 'apm install' in combined # recovery action must be present in drift output
[nit] services.py scratch_root guard rejection path has no unit test at tests/unit/install/test_drift.py
Grepped tests/unit/ and tests/integration/ for scratch_root; no match covering the rejection case (project_root escaping scratch_root).
Suggested: Add a unit test asserting PathSecurityError is raised when project_root points outside scratch_root.
Proof (missing at unit): tests/unit/install/test_drift.py::test_scratch_root_guard_rejects_project_root_escaping_scratch
with pytest.raises(PathSecurityError): run_drift_replay(project_root=outside, scratch_root=scratch)

_{This panel is advisory. It does not block merge. Re-apply the

panel-review label after addressing feedback to re-run.}

Note

🔒 Integrity filter blocked 1 item

The following item was blocked because it doesn't meet the GitHub integrity level.

feat(audit): default-on integration drift detection #1137 pull_request_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".

To allow these resources, lower min-integrity in your GitHub frontmatter:

tools:
  github:
    min-integrity: approved  # merged | approved | unapproved | none

Generated by PR Review Panel for issue #1137 · ● 5.8M · ◷

danielmeppiel · 2026-05-05T09:03:35Z

Acknowledging the follow-up batch from the panel synthesis (#issuecomment-4377781459). Working through these in disciplined order:

C1 Supply Chain (highest-risk): add CacheMissError/warning for remote deps with resolved_commit=None + unit test
C2 Python Architect: wire _ReadOnlyProjectGuard into run_replay()
C3 DevX UX + CLI Logging: stderr message when CacheMissError is swallowed in bare apm audit
C4 Doc Writer: update ci-cd.md / governance-guide.md / security.md baseline-check counts + CHANGELOG link fix
C5 OSS Growth: CHANGELOG user-promise lead, "Try it now" block, before/after section heading

Each will be implemented via the corresponding specialist agent (with test-coverage cross-validation on C1–C3). Posting progress as commits land.

danielmeppiel · 2026-05-05T09:03:55Z

(read-only)

C1 (supply-chain): Fail closed on unpinned remote deps - cache_pin.find_unpinned_remote_deps() helper + stderr warning in sync_markers_for_lockfile - drift._materialize_install_path raises CacheMissError for remote deps with resolved_commit=None (was silent fail-open) - Replaced silent-skip test with warning assertion + new helper test C2 (architecture): Wire _ReadOnlyProjectGuard into run_replay - run_replay() now wraps the deps loop with _ReadOnlyProjectGuard on governed root dirs + apm.lock.yaml + AGENTS.md - Regression test: monkeypatched leaky integrator triggers ProtectedPathMutationError C3 (cli-logging-ux): Stderr message on swallowed CacheMissError - audit._audit_content_scan emits '[!] drift check could not run: <msg>' to stderr when drift_failed and no findings (covers cache miss, missing lockfile, cache-pin error) - Integration test e10 asserts stderr message in bare-audit path C4 (docs): Baseline-check phrasing + CHANGELOG link - governance-guide, ci-cd, cli-commands now read '7 baseline checks plus integration drift detection' - CHANGELOG drift-detection link points to docs site URL C5 (oss-growth): User-promise framing - CHANGELOG drift entry leads with the user promise (forgotten installs + hand-edits) before mechanism - drift-detection.md gains a 'Try it now' block at the top - Before/after CI comparison promoted to its own subsection with explicit framing of what the bash workaround missed Verification: ruff check + format silent; 7621 unit tests + 27 drift integration tests green. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

danielmeppiel · 2026-05-05T09:15:45Z

Panel follow-ups C1-C5 addressed (commit `60bf38a`)

All five items from the panel review are implemented per the specialist guidance, with cross-validation across the relevant areas.

C1 -- Supply-chain: fail closed on unpinned remote deps

Specialist: supply-chain-security
Why it matters: A remote dep with resolved_commit=None (lockfile from older APM, branch-tracked dep, etc.) cannot have a cache-pin marker written -- so cache-freshness verification was being silently skipped. That's a fail-open on a secure-by-default surface.

Changes:

cache_pin.find_unpinned_remote_deps() -- new exported helper.
cache_pin.sync_markers_for_lockfile -- emits [!] stderr warning enumerating affected dep names.
drift._materialize_install_path -- raises CacheMissError for unpinned remote deps before scratch-deploy. Local deps and pinned remotes unaffected.

Tests: test_sync_markers_warns_on_remote_unpinned_dep, test_find_unpinned_remote_deps_excludes_local_and_pinned, test_materialize_unpinned_remote_dep_raises_cache_miss, test_materialize_local_dep_without_commit_does_not_raise.

C2 -- Architecture: wire `_ReadOnlyProjectGuard` into `run_replay`

Specialist: python-architect
Why it matters: The defense-in-depth no-write contract was dead code -- guards.py defined the guard but run_replay never wrapped its loop with it.

Changes:

run_replay() now wraps the deps-loop in _ReadOnlyProjectGuard(project_root, [*sorted(governed), "apm.lock.yaml", "AGENTS.md"]). Snapshots (mtime_ns, size) per file under protected subpaths and raises ProtectedPathMutationError on any mutation.
apm_modules/ and the scratch root remain mutable (legitimate cache writes / scratch deploys).

Test: test_run_replay_wraps_loop_with_readonly_guard -- monkeypatches an integrator to write into a governed dir and asserts ProtectedPathMutationError is raised.

C3 -- DevX UX + cli-logging: stderr message on swallowed `CacheMissError`

Specialists: devx-ux + cli-logging-expert (cross-validation per panel CEO)
Why it matters: Bare apm audit was silently returning no drift findings when the cache was stale -- zero feedback violated the failure-mode principle.

Changes: audit._audit_content_scan -- after _check_drift returns, when drift_failed and not drift_findings, emit [!] drift check could not run: <reason> on stderr (stdout payload unchanged for JSON consumers). Covers cache miss, missing lockfile, cache-pin error.

Test: test_e10_bare_audit_surfaces_cache_miss_on_stderr -- monkeypatches apm_cli.install.drift.run_replay to raise CacheMissError, asserts stderr contains drift check could not run and cache miss, asserts stdout JSON still parses.

C4 -- Docs: baseline-check count + CHANGELOG link

Specialist: doc-writer
Why it matters: Public doc surfaces understated shipped behavior. Enterprise evaluators read governance + security first.

Changes:

enterprise/governance-guide.md (3 locations), integrations/ci-cd.md (2 locations), reference/cli-commands.md (1 location): "7 baseline checks" -> "7 baseline checks plus integration drift detection" (or equivalent framing).
CHANGELOG.md: drift-detection link rewritten to canonical doc-site URL (relative path didn't render on GitHub or the doc site).

C5 -- OSS Growth: user-promise framing

Specialist: oss-growth-hacker
Why it matters: The strongest launch asset (drift-as-default) was buried mid-CHANGELOG and the docs page led with mechanism instead of outcome.

Changes:

CHANGELOG.md drift entry now leads with "apm audit now catches forgotten installs and hand-edits by default." before the mechanism description.
docs/.../guides/drift-detection.md:
- Try it now block at the top with cd <project> && apm audit and a quick guide to first-run results.
- Before/after CI comparison promoted to its own subsection (### Before vs after: the legacy bash workaround) with explicit framing that apm audit --ci catches three classes of drift the bash workaround missed (unintegrated, orphaned, modified).

Verification

uv run --extra dev ruff check src/ tests/ -- silent
uv run --extra dev ruff format --check src/ tests/ -- silent (681 files)
uv run pytest tests/unit -q -- 7621 passed in 81.94s
uv run pytest tests/unit/install/ tests/integration/test_drift_check.py -x -q -- 603 passed in 6.36s

Ready for re-review.

Collapse the two added entries (drift + cache-pin markers) into one short line that answers the developer 'so what?' and points to the Drift Detection guide for the full mechanism + opt-out + cache-pin details. Per maintainer feedback: the previous entries were too long for a CHANGELOG. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Roll forward the four PRs merged since v0.12.1: - #1137 feat(audit): default-on integration drift detection - #1135 fix(deps): subdir-agnostic bare cache (parallel sparse-checkout race) also resolves duplicate report #1140 (ADO sub-path manifestation) - #1129 docs: first-package guide accuracy (includes: auto, skill paths) - #1127 docs: APM's role for skills, plugins-as-packages, ADO sub-paths Bump pyproject.toml + uv.lock 0.12.1 -> 0.12.2 and convert the Unreleased CHANGELOG block into the 0.12.2 section, with a single 'so what' line per merged PR per the changelog contract. Co-authored-by: Daniel Meppiel <copilot-rework@github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ocks v0.12.2 release) (#1142) * fix(install): exclude .apm-pin marker from package content hash Re-installing any package previously installed by APM 0.12.2 erased its apm_modules/ directory: the .apm-pin cache marker (added by #1137 for drift-replay) is written AFTER hash recording, so the on-disk hash diverged from the lockfile hash on every subsequent install, falsely tripping the supply-chain content-hash mismatch check and rmtree'ing the cache. Surfaced by tests/integration/test_guardrailing_hero_e2e.py, which runs 'apm install <virtual-file>' twice via run_command(show_output= True) and asserts apm_modules/github/<flattened-name> still exists -- the v0.12.2 release tag failed in CI on every platform on this test. Fix: exclude .apm-pin from compute_package_hash via a new _EXCLUDED_FILES set, mirroring _EXCLUDED_DIRS. Add regression test test_skips_apm_pin_marker. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix(install): scope .apm-pin exclusion to package root Address Copilot review feedback on PR #1142: a basename match would exclude any .apm-pin file at any depth, so a malicious package could hide bytes under subdir/.apm-pin to bypass the integrity hash. The cache-pin marker is only ever written to the package root by cache_pin.write_marker, so scope the exclusion accordingly. Extend the regression test to assert nested .apm-pin files are still hashed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Daniel Meppiel <copilot-rework@github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Daniel Meppiel and others added 4 commits May 4, 2026 22:53

danielmeppiel marked this pull request as ready for review May 4, 2026 21:53

danielmeppiel requested a review from sergio-sisternes-epam as a code owner May 4, 2026 21:53

Copilot AI review requested due to automatic review settings May 4, 2026 21:53

Copilot started reviewing on behalf of danielmeppiel May 4, 2026 21:54 View session

Copilot AI reviewed May 4, 2026

View reviewed changes

Daniel Meppiel and others added 3 commits May 5, 2026 09:53

danielmeppiel added the panel-review Trigger the apm-review-panel gh-aw workflow label May 5, 2026

danielmeppiel mentioned this pull request May 5, 2026

Epic: Close the producer-produced drift-detection gap (per-asset lockfile + skill-as-package + apm audit --drift) #898

Closed

8 tasks

github-actions Bot removed the panel-review Trigger the apm-review-panel gh-aw workflow label May 5, 2026

danielmeppiel merged commit 6e31bc5 into main May 5, 2026
11 checks passed

danielmeppiel deleted the feat/drift-detection branch May 5, 2026 09:23

danielmeppiel mentioned this pull request May 5, 2026

chore(release): v0.12.2 #1141

Merged

danielmeppiel mentioned this pull request May 5, 2026

fix(install): exclude .apm-pin marker from package content hash (unblocks v0.12.2 release) #1142

Merged


		### Added

		- `apm audit` now detects integration drift by default. Read-only cache-only install replay catches missed `apm install` runs, hand-edited deployed files, and orphaned files. Findings exposed in JSON (`drift` key) and SARIF (`apm/drift/<kind>` rules); in `--ci` mode they contribute to the exit code. Opt out with `--no-drift` (mutually exclusive with `--strip`/`--file`). See the [Drift Detection guide](docs/src/content/docs/guides/drift-detection.md) for details. (#1071, supersedes scope of #898)

Conversation

danielmeppiel commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

feat(audit): default-on integration drift detection

TL;DR

Problem (WHY)

Approach (WHAT)

Implementation (HOW)

Diagrams

Trade-offs

Benefits

Validation

Scenario evidence — what is tested to work

Scenario evidence — what is tested NOT to regress

How to test

Uh oh!

danielmeppiel commented May 4, 2026

APM Review Panel: ship_with_followups

Panel summary

Top 5 follow-ups

Architecture

Recommendation

Python Architect

CLI Logging Expert

DevX UX Expert

Supply Chain Security Expert

OSS Growth Hacker

Auth Expert -- inactive

Doc Writer

Test Coverage Expert

Uh oh!

danielmeppiel commented May 4, 2026

Addressed in-PR (commit 7907865)

Tracked as follow-up issues (CEO call)

Status

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

danielmeppiel commented May 5, 2026

Follow-ups landed (3 commits on top of feat/drift-detection)

Commit map

Cache-pin marker contract (B1, supply-chain follow-up)

Proof of integration test coverage

Coverage of past + present drift hazards

Lint

Uh oh!

github-actions Bot commented May 5, 2026

APM Review Panel: ship_with_followups

Panel summary

Top 5 follow-ups

Architecture

Recommendation

Python Architect

CLI Logging Expert

DevX UX Expert

Supply Chain Security Expert

OSS Growth Hacker

Auth Expert -- inactive

Doc Writer

Test Coverage Expert

Uh oh!

danielmeppiel commented May 5, 2026

Uh oh!

danielmeppiel commented May 5, 2026

Uh oh!

danielmeppiel commented May 5, 2026

Panel follow-ups C1-C5 addressed (commit 60bf38a)

C1 -- Supply-chain: fail closed on unpinned remote deps

C2 -- Architecture: wire _ReadOnlyProjectGuard into run_replay

C3 -- DevX UX + cli-logging: stderr message on swallowed CacheMissError

C4 -- Docs: baseline-check count + CHANGELOG link

C5 -- OSS Growth: user-promise framing

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

danielmeppiel commented May 4, 2026 •

edited

Loading

APM Review Panel: `ship_with_followups`

Addressed in-PR (commit `7907865`)

Follow-ups landed (3 commits on top of `feat/drift-detection`)

APM Review Panel: `ship_with_followups`

Panel follow-ups C1-C5 addressed (commit `60bf38a`)

C2 -- Architecture: wire `_ReadOnlyProjectGuard` into `run_replay`

C3 -- DevX UX + cli-logging: stderr message on swallowed `CacheMissError`